Skip to main content

1.1 - What is RL for SC2

The documentation so far has focused on building scripted agents. In that paradigm, you, the developer, explicitly write the rules (if/then/else logic) that determine the bot's behavior. Reinforcement Learning (RL) introduces a fundamentally different approach: instead of writing the rules, you define a goal, and the agent learns its own rules through trial and error.

This section will guide you through building a learned agent, where the "brain" of the bot is a neural network model trained using libraries like stable-baselines3.

Two Paradigms- Scripted vs. Learned

FeatureScripted Agent (Traditional Bot)Learned Agent (RL Bot)
Developer's RoleWrite explicit rules for every scenario.Design the learning environment, observation, actions, and reward signal.
Decision MakingBased on a pre-defined logic tree.Based on a learned policy that maps observations to actions.
BehaviorPredictable and consistent.Can be creative and discover non-obvious strategies.
Primary ChallengeAnticipating all possible game states.Designing a reward function that leads to desired behavior; long training times.

The Core RL Workflow

At its heart, RL is a feedback loop between an Agent and an Environment.

          +------------------+           +------------------+
| |-----------| |
| Agent | Action | Environment |
| (Your RL Model) |---------->| (StarCraft II) |
| | | |
+-------^----------+ +--------v---------+
| |
| Observation + Reward |
+--------------------------------+
  1. The Agent observes the state of the Environment.
  2. Based on the observation, the Agent chooses an Action.
  3. The Environment updates based on the Action and returns a new Observation and a Reward signal.
  4. The Agent uses the Reward to update its internal strategy (its Policy) to take better actions in the future.

Key Terminology

TermDefinition in our SC2 Context
AgentThe stable-baselines3 model we will train. This is the "brain."
EnvironmentThe StarCraft II game, wrapped in a custom gymnasium interface.
ObservationA numerical representation of the game state (e.g., an array with minerals, supply, unit health).
ActionA discrete choice the Agent makes (e.g., 0 for "do nothing," 1 for "build worker").
RewardA numerical score (+1, -10, etc.) we give the Agent to reinforce good behavior and punish bad behavior.
PolicyThe internal strategy of the Agent; essentially, a function that decides which action to take for a given observation.

Our Goal for This Section

Our project is to build the bridge between python-sc2 and stable-baselines3 and train agents to complete specific tasks.

  • Task 1: Create a reusable gymnasium environment for StarCraft II.
  • Task 2: Train an agent to perform a simple economic task.
  • Task 3: Train an agent to make a simple macroeconomic trade-off.
  • Task 4: Train an agent to perform a simple combat micro task.

Why is RL Challenging in StarCraft II?

Applying RL to a complex game like StarCraft II is difficult for two main reasons:

  • Massive State Space: The number of possible game states is nearly infinite. We must carefully select a small subset of features for our agent to observe.
  • Credit Assignment Problem: A game can last for minutes. If you win, was it because of a good decision you made 30 seconds ago or 10 minutes ago? Designing reward functions that correctly assign credit is a major challenge.

In the following chapters, we will address these challenges by starting with very simple, focused tasks.